Introduction to State Machine Replication
Learn about state machine replication—a general framework for building fault-tolerant distributed services.
We'll cover the following
Motivation#
Providing fault-tolerant services to the clients is a desirable property of a system. State machine replication (SMR) is a mechanism to implement fault-tolerant services. SMR models a system as a state machine and replicates multiple copies of these state machines such that failures are independent (meaning one failure only impacts one state machine). These state machines start with the same initial state, and the subsequent clients' requests reach every replica in the same order, which applies those commands to arrive at the same new state of the state machine (we are assuming deterministic logic that transitions the state machine from one state to the next).
The core component of SMR is the atomic broadcast facility, which enables every state machine to get the commands in the same order. NIST explains the purpose of state machine replication (SMR) as follows: “The objective of state machine replication (SMR) is to emulate a centralized service in a distributed, fault-tolerant fashion.” In this chapter, we will learn how to implement SMR when Byzantine and fail-stop failures are possible and how to reconfigure a replica group by excluding the faulty nodes.
Replication is a widely used technique to design fault-tolerant distributed systems where we maintain replicas of data or services. Fault tolerance is a way to achieve higher availability and reliability. To tackle different kinds of failures, the replicated sites should be on different physical servers. Such servers might be in different data centers, which might be far apart.
Example: Replicated database
Let's take the example of a database. A database has many data structures to efficiently facilitate clients' read and write requests. These data structures might constitute the state of the database. We can replicate this database at multiple places by replicating its state. When a client wants to read or write to the database, such a command is broadcast to all the replicas. Each replica, on receiving such requests in a specific order, executes it on the local state machine, and the state machine transitions from one state to the next. When all such requests have been applied, all the state machines should have the same state.
Note: Remember that we assume deterministic transition logic so that all replicas could end up at the same state. By taking the majority vote of the outputs of the replicas, we could detect the one which had faults and didn't reach a correct next state.
Before we define fault tolerance in detail, let's review different kinds of failure models.
Node failures#
A component is considered faulty when its behavior is inconsistent with its specification. The entire spectrum of failures considered in distributed systems is covered by its two ends.
Byzantine failures#
A Byzantine failure occurs in a node when it exhibits arbitrary behavior with other nodes, with the possibility of appearing to be working fine. It is not always possible to detect Byzantine nodes in a replica group.
1 of 3
2 of 3
3 of 3
Fail-stop failures#
A fail-stop failure occurs in a node when its response to failure is to change to a state that reliably indicates its failure to other nodes.
1 of 2
2 of 2
Points to ponder
Question 2
Why didn’t we discuss possible network failures and only discuss node failures?
The atomic broadcast subcomponent of SMR that broadcasts the clients’ commands in the same order to all the replicas in the group needs to handle all the possible network failures. Therefore, it is not true that SMR is not concerned about network failures.
2 of 2
Tolerating failures#
The motivation behind replication to improve fault tolerance is that a system with a single centralized server is as fault-tolerant as that server. For higher fault tolerance, we require replicas of servers that fail independently. This means that the failure of a replica should not depend on the failure of another. As a result, the system's failure is not contingent on one node failing but on many nodes failing. But how many nodes are allowed to fail before the system fails? We need a way to specify fault tolerance.
Traditionally, fault tolerance is specified as the mean time between failures (MTBF). MTBF is an operational metric. Teams strive for maximum MTBF. While MTBF is a good way to measure fault tolerance, we can also specify fault tolerance in terms of the number of node or replica failures in a system, which comes with its advantages.
A system is considered to be
This approach makes explicit assumptions about the reliability of a system which the MTBF approach does not. MTBF being average can be affected adversely by very high or very low values.
A system's
fault tolerance is unrelated to its components' reliability. This approach provides a measure of fault tolerance provided by the system's architecture rather than its components.
Note: As a designer, knowing that a service can tolerate, for example, three independent node failures is more informative than saying that the MTBF of a service is two years.
Bird's eye view#
The term state machine comes from finite state machines (or finite state automata). Our state machines in SMR can also have a finite amount of states like finite state automata. Before delving deeper into SMR, we’ll review some concepts regarding a state machine in the next lesson.
Quiz on Two-Phase Commit
State Machines